[UR][L0v2] Migrate discrete buffer through host when P2P is not accessible by ldorau · Pull Request #22010 · intel/llvm

ldorau · 2026-05-13T11:18:14Z

When a buffer on a discrete GPU needs to be accessed from a different
device and P2P access is not enabled, migrate the data through a USM
HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE.

The migration uses a two-step copy:

Synchronous device->host copy using the source device's own command
list (the destination device cannot reach source device memory
without P2P).
Async host->device copy enqueued on the caller's command list (host
memory is accessible by all devices, so this is safe).

Before the device->host copy, any pending operations on the caller's
command list are ordered and drained via zeCommandListAppendWaitOnEvents

zeCommandListHostSynchronize, ensuring prior kernel writes to the
source buffer are visible. A fully synchronous fallback is used when
no command list is available (e.g. urMemGetNativeHandle).

Only one staging buffer is kept alive at a time: it is released at the
start of the next migration after zeCommandListHostSynchronize confirms
the previous async copy has completed.

A new ensureDeviceAlloc helper allocates the destination device buffer
without the activeAllocationDevice side-effect of allocateOnDevice,
so the active-device state is only updated after the async copy is
successfully enqueued.

Fixes: #22007
Fixes: #22008

ldorau · 2026-05-14T07:24:55Z

Please review @intel/unified-runtime-reviewers-level-zero

mateuszpn · 2026-05-14T09:35:16Z

+    // Migrate buffer through the host: copy from the current device to a
+    // temporary host buffer, then from host to the target device.
+    auto bufferSize = getSize();
+    std::vector<char> hostBuf(bufferSize);


nit: maybe it is worth to consider USM allocation in place of heap, like in line 100

pbalcer · 2026-05-14T10:39:08Z

+    for (uint32_t i = 0; i < waitListView.num; i++) {
+      ZE2UR_CALL_THROWS(zeEventHostSynchronize,
+                        (waitListView.handles[i], UINT64_MAX));
+    }


I don't think this will work. The operation also needs to be ordered with regards to the command list itself, so something like this will be better:

if (numWaitEvents > 0) { ZE2UR_CALL(zeCommandListAppendWaitOnEvents, (zeCommandList.get(), numWaitEvents, pWaitEvents)); } ZE2UR_CALL(zeCommandListHostSynchronize, (zeCommandList.get(), UINT64_MAX));

pbalcer · 2026-05-14T10:44:20Z

+    auto bufferSize = getSize();
+    std::vector<char> hostBuf(bufferSize);
+
+    UR_CALL_THROWS(synchronousZeCopy(hContext, activeAllocationDevice,


I don't like the fact that this is synchronous. Can you explore what it would take to make it async? I think we'd need to keep the allocation somewhere.

Changed. Is it OK now?

ldorau · 2026-05-15T11:50:10Z

@mateuszpn @pbalcer re-review please

ldorau · 2026-05-18T09:08:13Z

@mateuszpn @pbalcer re-review please

…sible When a buffer on a discrete GPU needs to be accessed from a different device and P2P access is not enabled, migrate the data through a USM HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE. The migration uses a two-step copy: 1. Synchronous device->host copy using the source device's own command list (the destination device cannot reach source device memory without P2P). 2. Async host->device copy enqueued on the caller's command list (host memory is accessible by all devices, so this is safe). Before the device->host copy, any pending operations on the caller's command list are ordered and drained via zeCommandListAppendWaitOnEvents + zeCommandListHostSynchronize, ensuring prior kernel writes to the source buffer are visible. A fully synchronous fallback is used when no command list is available (e.g. urMemGetNativeHandle). Only one staging buffer is kept alive at a time: it is released at the start of the next migration after zeCommandListHostSynchronize confirms the previous async copy has completed. A new ensureDeviceAlloc helper allocates the destination device buffer without the activeAllocationDevice side-effect of allocateOnDevice, so the active-device state is only updated after the async copy is successfully enqueued. Fixes: intel#22007 Fixes: intel#22008 Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

Add four conformance tests exercising discrete buffers accessed from two different device queues when P2P access is not available. Tests covering the async migration path (cmdList != nullptr, triggered by urEnqueueMem* operations): - AsyncFillThenReadOnSecondQueueWithWait: fills a buffer on queues[0] and reads it on queues[1] using an explicit event dependency. - PingPongFillBetweenTwoDeviceQueues: alternates fills between queues[0] and queues[1], each read on the opposite queue using event dependencies. - ChainedAsyncOpsAcrossQueuesWithEvents: chains fill, blocking write, and read across two queues using cross-queue events. Test covering the synchronous fallback path (cmdList == nullptr, triggered by urMemGetNativeHandle): - SyncFallbackMigrationViaNativeHandle: fills the buffer on device 0, calls urMemGetNativeHandle for device 1 to trigger synchronous host-staged migration, then verifies the data on device 1. All tests add an explicit queues.size() < 2 guard (GTEST_SKIP) in case the fixture minimum-device requirement changes, and cross-queue ordering is expressed with events throughout to properly exercise the async migration path. A dedicated L0 v2 adapter runner (discrete_buffer_host_migration.cpp) reuses the conformance test source under UR_LOADER_USE_LEVEL_ZERO_V2. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

The test was intermittently failing on CI hardware because the queue create + USM fill + urQueueFinish sequence before the memory measurement introduced a multi-millisecond time window. During that window, async driver cleanup from earlier P2P tests (which can fail to evict peer residency via zeContextEvictMemory) or concurrent GPU workloads on shared CI machines could change devices[1]'s GLOBAL_MEM_FREE reading enough to trigger the assertion. The queue/fill/finish operations are not needed to test the residency property: zeContextMakeMemoryResident is invoked at urUSMDeviceAlloc time, so measuring immediately after the allocation captures any peer-residency side-effects without a blocking GPU operation in between. Remove those operations to keep the measurement window as short as possible, matching the pattern already used in allocationInitiallyAbsentOnPeer. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

ldorau · 2026-05-20T07:16:21Z

@mateuszpn @pbalcer re-review please

pbalcer · 2026-05-21T07:33:12Z

what is this file for?

pbalcer · 2026-05-21T07:34:38Z

when can cmdList be null?

pbalcer · 2026-05-21T07:41:50Z

this isn't needed.

pbalcer · 2026-05-21T07:42:04Z

do this copy on cmdList

pbalcer · 2026-05-21T07:42:19Z

do sync after this cmdlist.

pbalcer · 2026-05-21T07:42:34Z

with the sync after final copy, migration staging buffer isn't needed.

ldorau requested a review from a team as a code owner May 13, 2026 11:18

ldorau mentioned this pull request May 13, 2026

enqueue-test/urEnqueueMemBuffer* tests often fail on UR L0v2 adapter #22008

Open

ldorau requested a review from kswiecicki May 13, 2026 12:29

mateuszpn reviewed May 14, 2026

View reviewed changes

pbalcer reviewed May 14, 2026

View reviewed changes

ldorau requested review from mateuszpn and pbalcer May 15, 2026 11:49

ldorau force-pushed the URL0_Migrate_discrete_buffer_through_host_when_P2P_is_not_accessible branch 2 times, most recently from 9727548 to 1e9d552 Compare May 19, 2026 07:16

ldorau changed the title ~~[UR][L0] Migrate discrete buffer through host when P2P is not accessible~~ [UR][L0v2] Migrate discrete buffer through host when P2P is not accessible May 19, 2026

ldorau requested a review from a team as a code owner May 19, 2026 08:57

ldorau added 2 commits May 19, 2026 09:59

pbalcer reviewed May 21, 2026

View reviewed changes

ldorau marked this pull request as draft May 21, 2026 09:31

Conversation

ldorau commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldorau commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldorau commented May 15, 2026

Uh oh!

ldorau commented May 18, 2026

Uh oh!

ldorau commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldorau commented May 13, 2026 •

edited

Loading